Red Wine Quality by Devin McCormack

This dataset tabulates the physicochemical properties, as well as subjective quality (1-10) of 1599 red wines from: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

The specific wine variety tested is a Portuguese wine called “Vinho Verde”. There are 11 attributes of these wines mearued with physicochemical tests, most measurements in grams (or milligrams) per liter. Exceptions are pH, which is a scale, and alcohol, which is given in percentage by volume.

These 11 attributes are measured in an attempt to model a “quality” score. Here quality is a median of at least 3 evaluations made by wine experts. Wine Quality is a complete sensory score influenced by the wine tasting process, which is generally laid out in the five “S” steps:

  1. See - the visual color and clarity of the wine is judged. This is not directly related to any of our measured components.

  2. Swirl - the density, related to alcohol and residual sugar is noted.

  3. Sniff - the “nose” of the wine is most important. From here a judge can assess the fermentation (acetic acid), alcohol, preservatives (sulfur dioxide), and exotic aromas (citric acid and others) of a wine.

  4. Sip - tasting the wine allows a judge to assess sweetness (residual sugar), tartness (acids and pH), salts (chlorides), and alcohol content. Additionally, various elements of smells change with temperature. Interestingly, a key component of taste is usually bitterness or tannin level, but the data does not reflect that this was physicochemically measured.

  5. Savor - the aftertaste of the wine more completely assesses alcohol and bitterness, as well as less volatile aromatics.

Now into data cleaning!

Data cleaning

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Based on the summary output there are two things I might want to change: first dropping the unnecessary X column, and changing quality to a factor.

One thing worth noting right now is that the quality range isn’t 1-10. There are no absolutely terrible or perfect red wines in this list.

At this point it is important to note the units of these measurements. All input variables besides pH, density, sulfur dioxides, and alcohol are in g/dm^3, aka grams per liter. Density is in g/cm^3, or grams per milliliter. This can be converted to g/dm^3 by multiplying by 1000. Alcohol is given in percent by volume. Knowing that the density of pure ethanol is 789 g/dm^3, we can also convert percent into g/dm^3. Both sulfur dioxide measurements are in mg/dm^3, so we can divide by 1000 to get it into grams.

Finally, pH is a base 10 scale of acidity. The different acids measured (i.e. tartaric, acetic, citric) have different acidity levels based on concentrations and other components in the water. The sum of these components should be negatively correlated with pH.

wine<-select(wine,-X)
wine$quality<-as.factor(wine$quality)

wine%>%
  mutate(density=density*1000,
         alcoholgperL=alcohol*7.89,
         free.sulfur.dioxide=free.sulfur.dioxide/1000,
         total.sulfur.dioxide=total.sulfur.dioxide/1000)->wine

head(wine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1               0.011                0.034   997.8 3.51      0.56     9.4
## 2               0.025                0.067   996.8 3.20      0.68     9.8
## 3               0.015                0.054   997.0 3.26      0.65     9.8
## 4               0.017                0.060   998.0 3.16      0.58     9.8
## 5               0.011                0.034   997.8 3.51      0.56     9.4
## 6               0.013                0.040   997.8 3.51      0.56     9.4
##   quality alcoholgperL
## 1       5       74.166
## 2       5       77.322
## 3       5       77.322
## 4       6       77.322
## 5       5       74.166
## 6       5       74.166

Univariate Plots Section

Quality Distribution

First a quick look at what we are trying to model, the quality score of the wine.

Here we see the distribution of the quality ratings. Once again, this shows that there is really a constricted range of values; all wines are rated 3-8, not 1-10. Note that the y axis of data is also plotted on log10 scale. This is because there is an order of magnitude less wines at the extremes than in the middle. It is rare for wines to be much better or worse than average.

Acids

Acids are important to wine, and the type and amount of acidity affects the smell, sip, and savor steps of winetasting.

Noteworthy are that all the wines have some amount of fixed acidity and volatile acidity, but not all wines have citric acid. Also noteworthy is the different scales of these graphs. Fixed acid is a measure of tartaric acid, which is a product of grapes - it makes sense that wine from grapes has a lot of it. Volatile acid is acetic acid, a component of vinegar, which is a byproduct of fermentation. Since wine is fermented, we expect some level of acetic acid, but no one wants particularly vinegary wine. Citric acid is related to citrus fruit, and is rarer to find in wine. However it is noted to be a pleasant addition. pH shows a fairly gaussian distribution, indicating that the total acidity is fairly normal, and centered around a range that similar to that of grapes. Later it will be worth looking at how correlated these variables are.

Density

Density is a major component of both swirling and sipping, and therefore may be a big factor in quality.

The peak density is below 1000, meaning that in general, the wine is less dense that pure water. This makes sense, since alcohol has lower density than water and it generally makes up ~10% of the volume.

Alcohol content

Alcohol affects the swirl, sip, and savor components. Alcohol also has effects on physiological conditions, and could influence scoring beyond the 5 “s” steps.

This data shows that most of the wines have between 9 and 11% alcohol by volume, and there seems to be a positive skew.

Residual sugar content

Sugar affects see, sip, and savor.

Most wines fall below 4 g/L, indicating that this wine type is likely a dry wine. There is a long right tail, but all are significantly less than the threshold for a sweet wine, 45g/L. The fermentation process converts sugar to alcohol, so it would be reasonable that, if starting sugar was constant, that alcohol and residual sugar would have a negative correlation.

A boxplot explores these outliers even more. Generally, wines are called “dry” if they have less than 10 g of sugar, making it likely that it is hard to distinquish sugar levels much at these levels. An expert likey will be able to taste that several of these outliers are “off-dry”, but unless there is a clear trend in the bivariate plots, it likely isn’t important for these wines.

Sulfur Dioxide (preservatives)

Sulfur Dioxide (SO2) is a preservative of wine, and can affect the sniff and sip. Too little preservatives can lead to spoiled wine, however it is unlikely that any of the tested wines were spoiled. Too much can give a sulfur taste, which is almost universally a bad taste for wine.

We see all three graphs have a right skew. All of these should be correlated, as total SO2 includes free SO2, and increased sulphates increases SO2.

Chlorides (salt)

It is unclear how much this affects quality, as I’ve never had a “salty” wine, but it can potentially manifest as a mineral, metallic, or savory taste, and influence texture - both components of sip.

There appears to be a normal distribution near .1g/L, with a very long right tail. It’s hard to say if such low levels of chlorides has any appreciable affect on taste.

Univariate Analysis

Generally, we see that features are normal with a right skew, with density being an exception with no discernable skew. The main feature, quality, shows that exceptional wines (both bad and good) are very uncommon. This can be due to the aggregation of the quality score (median of 3 or more tastes), psychological mean reversion due to contrast of qualitative measurements over time, or just a overall small difference between wines of this variety.

What is the structure of your dataset?

Data is structured where each row is a single bottle of Vihno Verde wine. There is no information about grape variety, vintage, winery, brand, or price. Each wine has physicochemical measurements as well as a median aggregated measure of quality from 3+ tastings by wine experts.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality, and we are using the rest of the features to try and model quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The features of most interest to me are features that predominately affect sniff, or multiple of the five “S” steps of tasting. These include alcohol, sugar, acids, and sulfur dioxide.

Did you create any new variables from existing variables in the dataset?

I created a new alcohol variable to turn percent volume into g/L, but it is still linearly related to alcohol, so it is not neccessarily a new variable in terms of linear regression.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I converted features, when possible, to all have the same units of g/L. Due to a right skew, many of the features can be scaled with log10, however I didn’t convert the raw numbers to log10 scale at this time.

Bivariate Plots Section

ggpairs analysis

First I can look at all the comparisions quickly with ggpairs, and then look closer with bigger plots.

Looking at the ggpairs (.pdf in the git folder), we can see a few trends. In the scatter plots, we can see a predictable correlation between various acidities and the pH. Also interesting to see trends between fixed acidity and density, and alcohol and density. This makes sense since those are the largest values in g/L, meaning they add the most mass to the liquid. Looking at the sugar boxplots, there doesn’t seem to be a trend with sugar content, and the outliers seem to be all pretty mediocre grades. This confirms my thought that sugar likely is an afterthought in this type of wine.

However, the most question central to this EDA is: what attributes affect the overall quality score of the wine? Looking at the boxplots for quality, we see there may be a downward trend in volatile acidity that affects quality - where more acetic acid in the wine may affect quality - likely by affecting the smell of the wine. Conversely, there looks to be an upwards trend with citric acid. Citric acid is noted to give the wine a “fresh” smell, which may boost quality. One more that stands outis that higher quality wines seems to have higher alcohol content - maybe a small change in the tipsyness of judges affects how they grade for quality?

Acids vs pH

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity + volatile.acidity + citric.acid and pH
## t = -37.418, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7087579 -0.6564574
## sample estimates:
##        cor 
## -0.6834838

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

The sum of the acid components is highly correlated with the pH, which makes sense, and makes pH a bit redundant. Looking at boxplots in ggpairs, the type of acid is more important than overall acidity, within the ranges we see in this dataset. It is noteworthy that it seems that fixed acidity is the driving factor in the sum, so if we seek to drop pH, we should include fixed acidity.

Fixed acidity vs density

looking at the correlation between fixed acidity, the largest contribution of mass besides alcohol to the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

We see a lot of correlation between these two, which makes sense since tartaric acid is the largest dissolved contribution to mass besides alcohol.

Alcohol vs density

Alcohol should also have a correlation with density, but it will be negative, since alcohol has less density than pure water.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Here we see that alcohol and density are negatively correlated, but there are other components that affect density as well. A mix of ethanol and pure water would follow the density curve of density=-2.11*(alcohol% by vol)+1000, but the points are all well above these values, indicating that there are heavier than water components in wine, like tartaric acid, sugar, etc.

Total vs Free sulfur dioxide

total sulfur dioxide (SO2) includes free SO2 in the measurment. It may be interesting to see what the ratio, of free SO2 to bound SO2 is in wines.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide - free.sulfur.dioxide and free.sulfur.dioxide
## t = 18.771, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3841336 0.4644895
## sample estimates:
##       cor 
## 0.4251489

Here, plotted with a dashed unity line (which would indicate 50% free SO2), and 2 solid lines indicating rough bounds. The boundaries of the data seem to be factors of 8, meaning that at most, there is ~8 times as much free SO2 than bound, and vice versa. the equilibrium between the two forms is likely highly dependent on solution contents (and temperature), but it is not likely that there are equilibrium points that create more than 8x as much of one form than the other. We see that most wines have less than 50% of their SO2 in free form. There doesn’t seem to be an obvious trend, especially if the two bound SO2 points are discounted.

total sulfur dioxide and sulphates

We might expect these to be correlated, since sulphates are often added to wine as to boost SO2, however we might also see that they are uncorrelated because adding sulphates may not be the only way to maintain SO2 levels in the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and sulphates
## t = 1.7178, df = 1597, p-value = 0.08602
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006087119  0.091774762
## sample estimates:
##        cor 
## 0.04294684

We see very little correlation between to two, and this might indicate that sulphates may not be the only way to maintain SO2, or that sulphates are only added to correct low SO2 levels, and not added uniformly.

Quality box plots

Importantly, we need to look at what features seem to vary with quality.

Boxplots are plotted over a jittered scatter of the data, with means represented with a contrasting star point. One trend that stands out in the ggpairs plots is that higher rated wines tend to have less volatile acids, and more citric acids. Both of these components can affect the nose of a wine, with volatile acids presenting as vinegar, and citric acid presenting as freshness. I think that maybe the ratio of acids is more important than the total amounts. Noteworthy is how far the 3 quality median value of citric acid is below the mean. This means that there are a few big outliers, but most are near zero.

Alcohol shows a pretty inticing trend, with higher quality wines having higher alcohol content. At first glance, we may attribute this to some fractional higher level of happiness that the expert may feel after drinking more alcohol, but it is important to also consider that changing the alcohol concentration in the solution may greatly affect the soluability of volatile compounds that affect the smell and taste of the wine.

Sulphates show a mild trend, maybe indicating that better wines use more SO2 to prevent even the slightly amount of spoilage or improper fermentation, but it is hard to say it is conclusive since there is such a small difference between levels, and there are a lot of outliers for middling wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We see that quality is positively related to alcohol, citric acid, and sulphates, and seems to be negatively related to volatile acidity.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We see that the sum of acids is highly correlated with the pH, which means that pH may be a redundant measurement. We also see high correlations between alcohol and density, and fixed acid and density.

What was the strongest relationship you found?

the strongest relationship I found was between sum total of acids and the pH, with a correlation coefficient of -0.68. more mass of acids means lower pH.

Multivariate Plots Section

Volatile vs citric acid colored by quality

Above we saw that volatile and citric acid seemed to have opposite relationships with quality. Here we plot them against each other, colored by quality to see if any grouping falls out in the data.

Here with the dashed unity line showing equal parts volatile and citric acid. Interestingly, we see some grouping of better quality wines looking to have more citric acid than volatile acid. Most interestingly is that the bad wines seem to be grouped along the upper y axis, indicating they have little to no citric acid, and higher levels of volatile acid than most other wines.

Sulphates vs volatile acids

Considering how bunched the bad wines are, maybe a lot of volatile acids indicates some secondary fermentation that is usually stopped by sulphate preservatives.

Interestingly we see some bunching again. worse wines usually have high volatile acids as well as low sulphates. Mayber there is some creedence to the idea that worse wines are not as carefully maintained, and have secondary fermentations that give off flavors.

Alcohol and citric acid

Alcohol and citric acid looked to both be positively related with quality.

We see some bunching of green in the top right, indicating high alcohol+ high citric acid usually means a better wine.

Modeling the data

I use ordinal regression, as we have a multiclass variable we are trying to describe, and the classes are ordered. This is done with the polr function.

I include all three acids, alcohol, and sulphates in the model.

## 
## Calls:
## m1: polr(formula = quality ~ citric.acid, data = wine, Hess = TRUE)
## m2: polr(formula = quality ~ citric.acid + fixed.acidity, data = wine, 
##     Hess = TRUE)
## m3: polr(formula = quality ~ citric.acid + fixed.acidity + volatile.acidity, 
##     data = wine, Hess = TRUE)
## m4: polr(formula = quality ~ citric.acid + fixed.acidity + volatile.acidity + 
##     alcohol, data = wine, Hess = TRUE)
## m5: polr(formula = quality ~ citric.acid + fixed.acidity + volatile.acidity + 
##     alcohol + sulphates, data = wine, Hess = TRUE)
## 
## ==============================================================================================
##                              m1            m2            m3            m4            m5       
## ----------------------------------------------------------------------------------------------
##   citric.acid               2.163***      2.414***     -0.380        -1.235**      -1.662***  
##                            (0.251)       (0.335)       (0.398)       (0.419)       (0.425)    
##   3|4                      -4.562***     -4.841***     -7.636***      3.397***      4.777***  
##                            (0.322)       (0.405)       (0.471)       (0.770)       (0.802)    
##   4|5                      -2.680***     -2.958***     -5.669***      5.322***      6.708***  
##                            (0.140)       (0.283)       (0.362)       (0.710)       (0.744)    
##   5|6                       0.431***      0.153        -2.303***      8.930***     10.379***  
##                            (0.083)       (0.258)       (0.321)       (0.710)       (0.749)    
##   6|7                       2.490***      2.214***     -0.084        11.665***     13.178***  
##                            (0.107)       (0.266)       (0.319)       (0.748)       (0.789)    
##   7|8                       5.150***      4.874***      2.627***     14.613***     16.167***  
##                            (0.251)       (0.350)       (0.389)       (0.797)       (0.839)    
##   fixed.acidity                          -0.041         0.055         0.191***      0.191***  
##                                          (0.037)       (0.038)       (0.041)       (0.041)    
##   volatile.acidity                                     -4.763***     -4.314***     -4.134***  
##                                                        (0.360)       (0.377)       (0.378)    
##   alcohol                                                             0.982***      0.984***  
##                                                                      (0.056)       (0.056)    
##   sulphates                                                                         2.210***  
##                                                                                    (0.323)    
## ----------------------------------------------------------------------------------------------
##   Aldrich-Nelson R-sq.      0.046         0.046         0.142         0.281         0.296     
##   McFadden R-sq.            0.020         0.020         0.070         0.165         0.178     
##   Cox-Snell R-sq.           0.047         0.047         0.153         0.324         0.344     
##   Nagelkerke R-sq.          0.051         0.052         0.168         0.357         0.379     
##   Likelihood-ratio         76.270        77.554       264.878       624.927       673.808     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1856.090     -1855.448     -1761.786     -1581.762     -1557.322     
##   Deviance               3712.181      3710.897      3523.573      3163.524      3114.643     
##   AIC                    3724.181      3724.897      3539.573      3181.524      3134.643     
##   BIC                    3756.444      3762.536      3582.590      3229.918      3188.415     
##   N                      1599          1599          1599          1599          1599         
## ==============================================================================================
##                  3   4   5   6   7   8
## modelEstimate                         
## 3                0   0   0   0   0   0
## 4                0   0   0   0   0   0
## 5                9  41 503 223  11   0
## 6                1  12 175 381 141  11
## 7                0   0   2  34  47   7
## 8                0   0   1   0   0   0

This model has high misclassifications, but seems to trend in the right direction. There are no gross misclassifications, save one “5” being judged as the sole “8”.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

One of the most visually inticing relationships that I plotted was citric acid vs volatile acid, colored by quality. It seemed that broadly, if the wine had more citric acid than volatile acids, it was good, and high volatile acids in general lead to worse wines.

A clear case of strengthening each other can be made for citric acid and alchol. It seems that the higher alcohol content wines with higher citric acid levels were generally better wines.

Were there any interesting or surprising interactions between features?

It is interesting, but not surprising, that high acetic (volatile) acid wines were generally worse. Our sense of smell is very sensitive to the vinegar smell, and it indicates the wine potentially has secondary fermentations that are usually undesireable. This is backed up somewhat by the sulphates vs volatile acids. Potentially, there are other smells and flavors that are not measured that result from similar fermentations that increase volatile acids.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Interestingly, the model seems to compress the data even further, misclassifying most wines in the 5/6 categories. It seems to not gravely mistake bad wines for good, or vice versa, so it is an intriguing starting point. The downfall of this model likely is the small sample size, especially for especially good or bad wines. When there are only 18 out of 1599 wines that are graded as an “8”, it is hard to generalize. Interestingly, the only wine that the model classifies as an 8 is actually a 5.


Final Plots and Summary

Plot One

Description One

Here we see that higher quality wines tend to have greater than a 1:1 ratio of citric acid to volatile acid. This is most evident when looking at poor wines, which tend to have no citric acid and high amounts of volatile acid. The linear fit for each quality type shows that volatile acid seems to be the important stratifier for quality. This plot shows the importance of these two acids to wine quality.

Plot Two

Description Two

Here we can see that bad wines have high acetic acid, and low sulphates. We also see that outliers with very high potassium sulphates tend to be poor if volatile acids are high. We can justify this by understanding that potassium sulphate is added to wine as a preservative, and if added in too low a quantity, or too late in the process, can cause secondary fermentations to occur. Great winemakers likely can add the minimal amount of sulphates to wine to prevent spoilage, while worse winemakers may over-react and add too much.

Plot Three

Description Three

Higher quality wines tend to have higher than average alcohol and citric acid content. This graph is important because it shows pretty nice stratification with the linear models. The lower quality wines have lower alcohol than higher quality wines. The negative slope in the highest quality group indicates a trend that lower alcohol wines can still be highly rated if they have higher citric acid content.


Reflection

This was a fun dataset, that lead to some interesting insights into the grading of wines. Both wine making and tasting are such complex processes, that even small differences in physicochemical measurements can compound into large differences in the final quality of the product. Some trends, such as having high acetic acid reducing the quality of wine, seem to agree with conventional wisdom of even novice winetasters. Others, such as positive association with alcohol content and quality, resonate with people who think that winetasting may all be psychosomatic - the percieved ability to contrast and describe wines is much more important than any actual small physicochemical difference makes in the satisfaction and quality of wine.

This dataset, however, suffered from being small on multiple fronts. The quality metric was compressed and did not span the entire range of values. Modeling the data was hamstrung by the fact that a vast majority of wines were “okay”, very few were “good” or “bad”, and none were “exceptional” or “terrible”.

I feel that the measurements lacked at least one crucial component in wine: tannins. Additionally, after researching Vihno Verde, I beleive that lactic acid and CO2 are two components that would be crucial to measure. And finally, it would have been a much more multifaceted and interesting analysis if there were some more qualitative measures, like categorization of tasting notes. Is there a threshold of acetic acid before a taster notes vinegar? How finely can an expert distinguish alcohol content? What acid balance is associated with different fruit comparisons?

However, the constraint of the dataset also made it easier to approach. Measuring a single type of wine, irrespective of grape varietal, growth region, vintage, etc. allows for a more focused approach without an explosion of factors.

Future work with this type of data, I think, is very much dependent on gathering more data! Describing qualitative feelings with quantitative measures is a major end goal of analytics, and wine tasting is a endless source of this data, it just needs to be collected.